Preface

For the final exam/project we will develop classification models using several approaches and compare their performance on a new dataset that is a subset of one of the datasets used in machine learning common task framework (CTF) competitions. A copy of it split into training (final-data-train.csv, with outcome yn available) and test (final-data-test.csv, stripped of the outcome, for prediction purposes only) datasets is available on our course website in Canvas as a zip-archive of all associated files.

Please notice, that at the end of this final exam/project you will be asked, in addition to the Rmarkdown and HTML files, to also make predictions for the observations in the test (not training!) dataset and upload them into Canvas as well. The expected format for the file with predictions for test dataset is two columns of comma-separated values, one row per observation in test dataset, first column – the observation identifier (column id in test dataset) and the second column – your best model predictions for each observation in the test dataset as yes/no indicator. To illustrate expected format the zip archive contains also a couple of examples of test predictions in this format for your reference as well (predictions-*.csv files in predictions-examples sub-folder in zip-archive).

One more time, to iterate and emphasize, please notice that this time your submission must consist of the following three (not just two, Rmd+html, as usual) items:

The teaching team invites you to load your predictions (just predictions, just for the test dataset according to the file format shown in the sample files in the zip-archive) into Canvas repeatedly as you work on your models and improve them over the course of this week. At least daily (or more frequently as we see fit) we will download predictions loaded by everyone by that time and compile a leaderboard in html format of all of them sorted by their accuracy as compared to the true values of the outcome for the test dataset (along with their sensitivity, specificity, etc.). This list will be made available on our course website in Canvas for everyone in this class in order to see how the performance of their models compares across the rest of the models built by other students in the class. The first version of the leaderboard posted on the course website at the time when final exam is made available starts with predictions made by those few example files provided in the zip-archive (coin flip, majority vote, etc. – what do you think Charlie Brown is using to make the predictions?). Those should be pretty easy to improve upon.

It is 100% up to you whether you want to upload your model predictions over the course of this week, how frequently you want to do it and what you want its results to be called in the leaderboard posted for everyone in the class to see. We will use the name of the 2nd column in the file with predictions (the one containing yes/no values, not the numerical ids of the observations) as the model name listed in the leaderboard. If you prefer not to use your name, choose something else instead, sufficiently unique so that it is easier for you to spot your result among all others. Once again, please check out sample files of dull (majority vote, coin flip, etc.) predictions we have made available and consider how they show up in the leaderboard html file already posted on Canvas website. Once you are done with final you are expected to load predictions from your best model into Canvas – that is part of your points total as explained below.

Lastly, the size of this dataset can make some of the modeling techniques run slower than what we were typically encountering in this class. You may find it helpful to do some of the exploration and model tuning on multiple random samples of smaller size as you decide on useful ranges of parameters/modeling choices, and then only perform a final run of fully debugged and working code on the full dataset. Please see also the afterword below on the computational demands of this problem set.

Problem 1: univariate and unsupervised analysis (20 points)

Before we get started, I’m going to define a few useful functions for later.

calcAcc <- function (cm) {
  sum(diag(cm)) / sum(cm)
}

calcSpecificity <- function (cm) {
  cm[1, 1] / sum(cm[, 1])
}

calcSensitivity <- function (cm) {
  if (ncol(cm) == 2) {
    cm[2, 2] / sum(cm[, 2])
  }
  else {
    0
  }
}

Download and read training and test data into R and prepare graphical and numerical summaries of it: e.g. histograms of continuous attributes, contingency tables of categorical variables, scatterplots of continuous attributes with some of the categorical variables indicated by color/symbol shape, etc. Whatever you find helpful to think about properties of the data you are about to start using for fitting classification models.

As it is often the case for such contests, the atributes in the dataset are blinded in the sense that no information is available about what those are or what their values mean. The only information available is that the attribute yn is the outcome to be modeled and the attribute id is the unique numerical identifier for each observation. Some of the remaining attributes are clearly categorical (those that are character valued) and some rather obviously continuous (those with numerical values with large number of unique values). For several of them it is less clear whether it is best to treat them as continuous or categorical – e.g. their values are numerical but there are relatively few unique values with many observations taking the same value, so that they arguably could be treated as continuous or categorical. Please idenify them, reflect on how you prefer to handle them and describe this in your own words.

Perform principal components analysis of this data (do you need to scale it prior to that? how would you represent multilevel categorical attributes to be used as inputs for PCA?) and plot observations in the space of the first few principal components indicating levels of some of the categorical attributes of your choosing by the color/shape of the symbol. Perform univariate assessment of associations between outcome we will be modeling and each of the attributes (e.g. t-test or logistic regression for continuous attributes, contingency tables/Fisher exact test/\(\chi^2\) test for categorical attributes). Summarize your observations from these assessments: does it appear that there are predictors associated with the outcome yn univariately? Which predictors seem to be more/less relevant?

Okay, let’s load in the dataset and take a look at what the data looks like.

rawData <- read_csv("final-data-train.csv") %>% mutate_if(is.character, as.factor)
## Parsed with column specification:
## cols(
##   yn = col_character(),
##   id = col_integer(),
##   cg = col_double(),
##   eg = col_character(),
##   re = col_character(),
##   ms = col_character(),
##   fw = col_double(),
##   cl = col_double(),
##   hw = col_double(),
##   ra = col_character(),
##   nc = col_double(),
##   se = col_character(),
##   ag = col_double(),
##   wc = col_character(),
##   ne = col_character(),
##   oc = col_character(),
##   dr = col_character()
## )
head(rawData)
## # A tibble: 6 x 17
##   yn        id    cg eg    re    ms       fw    cl    hw ra       nc se   
##   <fct>  <int> <dbl> <fct> <fct> <fct> <dbl> <dbl> <dbl> <fct> <dbl> <fct>
## 1 no    177729  1.72 e2    pzv   lz    3.33   1.41  2.45 ojk     0.3 vrm  
## 2 no    513216  1.83 e2    wto   ls    3.27   1.42  3    kvy     1.2 vrm  
## 3 no    297005  1.61 e2    wib   kh    0.537  1.35  3    gmp     0   vrm  
## 4 no    467346  1.62 e2    wib   kh    1.92   1.36  2.83 kvy     0.5 vrm  
## 5 no    229364  1.59 e2    wto   im    3.27   1.36  2.65 kvy     0   vrm  
## 6 yes   378290  1.93 e2    wto   ls    1.10   1.43  3.32 kvy     0.5 vrm  
## # ... with 5 more variables: ag <dbl>, wc <fct>, ne <fct>, oc <fct>,
## #   dr <fct>
summary(rawData)
##    yn              id               cg              eg          re       
##  no :22165   Min.   :    41   Min.   :1.313   e2     :26526   ldi: 1286  
##  yes: 7257   1st Qu.:149610   1st Qu.:1.656   e8     :  866   pzv: 8983  
##              Median :298736   Median :1.745   e6     :  772   wib: 3443  
##              Mean   :298754   Mean   :1.785   e5     :  335   wto:15710  
##              3rd Qu.:446830   3rd Qu.:1.867   e3     :  325              
##              Max.   :595175   Max.   :4.720   e1     :  318              
##                                               (Other):  280              
##        ms             fw                  cl              hw       
##  lz     :9356   Min.   : 0.000284   Min.   :0.000   Min.   :0.000  
##  kh     :7255   1st Qu.: 0.646817   1st Qu.:1.338   1st Qu.:2.830  
##  im     :4142   Median : 1.385567   Median :1.369   Median :3.160  
##  ls     :2891   Mean   : 1.770069   Mean   :1.286   Mean   :3.004  
##  ow     :2491   3rd Qu.: 2.444534   3rd Qu.:1.394   3rd Qu.:3.460  
##  hf     :1593   Max.   :26.224451   Max.   :1.632   Max.   :3.740  
##  (Other):1694                                                      
##    ra              nc           se              ag          wc       
##  erm:   33   Min.   :0.0000   jtl: 3038   Min.   :0.1414   cz:  114  
##  gmp: 6074   1st Qu.:0.2000   vrm:26384   1st Qu.:0.3162   gf: 1279  
##  kvy:21454   Median :0.3000               Median :0.3742   li:  794  
##  ojk:  578   Mean   :0.4152               Mean   :0.3746   pv: 9912  
##  tdg: 1283   3rd Qu.:0.6000               3rd Qu.:0.4000   qx:   37  
##              Max.   :1.8000               Max.   :0.8185   xv:17286  
##                                                                      
##    ne              oc        dr       
##  jle: 6416   lr     :9637   oi:23467  
##  kcr: 8487   al     :9568   qe: 5955  
##  rql: 1929   ey     :3331             
##  ydq:12590   jz     :2406             
##              jl     :1376             
##              rk     : 844             
##              (Other):2260
rawData %>% select_if(is.numeric) %>% select(-id) %>% pairs(col = rawData$yn)

Columns ag and cg appear to have pretty good correlation to the target. Something interesting is happening in coumn cl, where many rows have value zero and non-zero quantities seem to be between 1 and about 1.7. It seems like column hw would benefit from log transformation. I don’t think it would benefit from conversion to categorical though, there seems to be some amount of correlation between the value and the target.

for (catRow in select_if(rawData, is.factor)) {
  print(table(rawData$yn, catRow))
}
##      catRow
##          no   yes
##   no  22165     0
##   yes     0  7257
##      catRow
##          e1    e2    e3    e4    e5    e6    e7    e8
##   no     42 21330   116    30   166   240    26   215
##   yes   276  5196   209   187   169   532    37   651
##      catRow
##         ldi   pzv   wib   wto
##   no    833  6345  3043 11944
##   yes   453  2638   400  3766
##      catRow
##         ay   hf   im   kh   ls   lz   ow   yd
##   no   562 1047 3128 5751 1977 7438 1727  535
##   yes  312  546 1014 1504  914 1918  764  285
##      catRow
##         erm   gmp   kvy   ojk   tdg
##   no     10  4527 16272   448   908
##   yes    23  1547  5182   130   375
##      catRow
##         jtl   vrm
##   no   1038 21127
##   yes  2000  5257
##      catRow
##          cz    gf    li    pv    qx    xv
##   no     51   502   638  7927    11 13036
##   yes    63   777   156  1985    26  4250
##      catRow
##        jle  kcr  rql  ydq
##   no  4803 6354 1445 9563
##   yes 1613 2133  484 3027
##      catRow
##         al   ey   jl   jz   lr   me   pw   qt   rk   ry   sb   wg   zo
##   no  8337 2836 1024 1824 6337   45  489  157  611   29  216  260    0
##   yes 1231  495  352  582 3300   39  231  104  233   25   78  565   22
##      catRow
##          oi    qe
##   no  17633  4532
##   yes  5834  1423

First thing worth noting in the categorical features is that target no is about three times more common than a target yes. Feature se seems to have pretty good correlation and is only two levels. It seems also that most target no samples have eg of e2, which may be helpful. There may be some correlations with the other factors but it’s difficult to tell with both the class imbalance and the higher number of factors.

rawData %>% mutate(eg2 = eg == "e2", yn = yn == "yes") %>% select(yn, eg2) %>% cor()
##             yn        eg2
## yn   1.0000000 -0.3564462
## eg2 -0.3564462  1.0000000

It definitely appears that e2 can provide some value in predicting yn.

rawData %>% mutate(sevrm = se == "vrm", yn = yn == "yes") %>% select(yn, sevrm) %>% cor()
##               yn      sevrm
## yn     1.0000000 -0.3240698
## sevrm -0.3240698  1.0000000

And se can provide some value as well.

In order to fix column cl, I’ll use correlated features to fit a linear regression on the non-zero-valued samples of cl and predict values to replace the zero-valued samples.

rawData %>% select_if(is.numeric) %>% select(-id) %>% filter(cl != 0) %>% 
  cor() %>% .[,"cl"] %>% sort(decreasing = TRUE)
##          cl          ag          cg          nc          hw          fw 
##  1.00000000  0.59097593  0.43125803  0.07142344  0.00259574 -0.01927038

Looks like cg and ag are our best bet.

clLRFit <- glm(cl ~ cg + ag, data = filter(rawData, cl != 0))
newcl <- predict(clLRFit, newdata = filter(rawData, cl == 0))

summary(clLRFit)
## 
## Call:
## glm(formula = cl ~ cg + ag, data = filter(rawData, cl != 0))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.27626  -0.02668   0.01131   0.02147   0.17417  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.186429   0.001828  649.06   <2e-16 ***
## cg          0.015981   0.001329   12.03   <2e-16 ***
## ag          0.421856   0.005008   84.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.001220637)
## 
##     Null deviance: 51.971  on 27564  degrees of freedom
## Residual deviance: 33.643  on 27562  degrees of freedom
## AIC: -106685
## 
## Number of Fisher Scoring iterations: 2
plot(clLRFit)

Not an incredible fit, but it will do better than all zeroes.

Now, I’ll create logical features which flag when eg = "e2" and se == "vrm" and cl == 0, log-transform hw, and replace zero-valued cl samples with the above predicted values. I also create a tbl where the factor columns have been replaced with numeric dummy columns.

trainData <- rawData %>%
  mutate(
    e2 = eg == "e2",
    sev = se == "vrm",
    clz = cl == 0,
    hw = log(hw + 1e-6)
    )
trainData[trainData$cl == 0, "cl"] <- newcl

trainDataDummies <- as.tibble(dummy.data.frame(as.data.frame(select(trainData, -yn)))) %>%
  bind_cols(select(trainData, yn))

With that done, let’s take a look at what a principal components analysis will look like on this data.

pcaFit <- prcomp(select(trainDataDummies, -yn), scale. = TRUE)
plot(pcaFit)

The explained variance by each dimension does not drop off all that quickly.

biplot(pcaFit, pc.biplot = TRUE)

In the first two dimensions, you see large vector components from wcpv, e2, wcxv, cl, and ag.

factorColNames <- colnames(select_if(trainData, is.factor))

for (factorColName in factorColNames) {
  pairs(pcaFit$x[, 1:3], col = trainData[[factorColName]])
  title(factorColName)
}

The data takes an interesting shape. There is definitely one large mass, and one much smaller mass of points. The small mass is all classified the same yn value. The larger mass also seems to be split into two attached masses. Some of the other categorical features also seem to split the data nicely, such as eg, ra, se, and wc.

Let’s try fitting some single variable logistic regressions on this data to get a sense for which numeric features have predictive power on their own.

k <- 10
kfInd <- sample(rep(c(1:k),length.out=nrow(trainData)))
xvalResultsLRSV <- tibble()

numericColNames <- colnames(select(select_if(trainData, is.numeric), -id))

for (fold in 1:k) {
  for (numericColName in numericColNames) {
    singleVarData <- bind_cols(
      var = trainData[, numericColName],
      yn = trainData[, "yn"]
    ) %>% mutate(yn = as.numeric(yn) - 1)
    trainDataFit <- singleVarData[kfInd != fold, ]
    trainDataVal <- singleVarData[kfInd == fold, ]
  
    lrFit <- glm(
      yn ~ ., 
      data = trainDataFit)
    
    lrPredict <- predict(lrFit, newdata = trainDataVal, type = "response")
    lrCM <- table(trainDataVal$yn, round(lrPredict))
  
    xvalResultsLRSV <- rbind(xvalResultsLRSV, tibble(
      fold = fold,
      acc = calcAcc(lrCM),
      sens = calcSensitivity(lrCM),
      spec = calcSpecificity(lrCM),
      model = paste("lr", numericColName, sep = "-")
    ))
  }
}
xvalResultsLRSV %>% gather(key = metric, value = value, -c(fold, model)) %>% 
  ggplot(aes(x = model, y = value, col = metric)) + geom_boxplot()

Some, such as ag and cl, are actually fairly decent on their own. Several have very low sensitivity. Column nc is sort of middling. Let’s see if a more complicated analysis can provide a better result.

Problem 2: logistic regression (20 points)

Develop logistic regression model of the outcome yn as a function of multiple predictors in the model. Which variables are significantly associated with the outcome? Test model performance on multiple splits of data into training and test subsets and summarize it in terms of accuracy/error/sensitivity/specificity.

Which features have high correlation with the target?

trainData %>% 
  select_if(is.numeric) %>% select(-id) %>% cbind(yn = as.numeric(trainData$yn)) %>% 
  cor() %>% .["yn", ] %>% sort(decreasing = TRUE)
##           yn           cg           ag           nc           cl 
##  1.000000000  0.484006546  0.343500911  0.256036764  0.160414523 
##           hw           fw 
##  0.089530953 -0.002996787

Decent correlation with yn and features cg, ag, and nc.

trainData %>% 
  select_if(is.numeric) %>% select(-id) %>% cbind(yn = as.numeric(trainData$yn)) %>% 
  cor()
##             cg           fw          cl          hw          nc
## cg  1.00000000 -0.006696750  0.44554281  0.29491908 0.201160808
## fw -0.00669675  1.000000000 -0.01859891 -0.00201130 0.002041653
## cl  0.44554281 -0.018598913  1.00000000  0.01607019 0.076810683
## hw  0.29491908 -0.002011300  0.01607019  1.00000000 0.028801311
## nc  0.20116081  0.002041653  0.07681068  0.02880131 1.000000000
## ag  0.65612819 -0.011617617  0.60919322  0.01339333 0.190846319
## yn  0.48400655 -0.002996787  0.16041452  0.08953095 0.256036764
##             ag           yn
## cg  0.65612819  0.484006546
## fw -0.01161762 -0.002996787
## cl  0.60919322  0.160414523
## hw  0.01339333  0.089530953
## nc  0.19084632  0.256036764
## ag  1.00000000  0.343500911
## yn  0.34350091  1.000000000

Some of these features are also fairly well correlated with each other, especially cg and ag.

Let’s try our luck and use all of the numeric and logical features.

lrFit <- glm(
  yn ~ cg + fw + cl + hw + nc + ag + e2 + se + clz, 
  data = trainData, family = binomial)

summary(lrFit)
## 
## Call:
## glm(formula = yn ~ cg + fw + cl + hw + nc + ag + e2 + se + clz, 
##     family = binomial, data = trainData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.1973  -0.5002  -0.2972  -0.1190   3.3611  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.301506   0.652877  -8.120 4.65e-16 ***
## cg           7.779193   0.161731  48.099  < 2e-16 ***
## fw          -0.001509   0.011694  -0.129  0.89730    
## cl          -4.522904   0.531707  -8.506  < 2e-16 ***
## hw          -0.023662   0.011509  -2.056  0.03979 *  
## nc           1.217986   0.045207  26.943  < 2e-16 ***
## ag           1.729410   0.538251   3.213  0.00131 ** 
## e2TRUE      -3.133920   0.058001 -54.032  < 2e-16 ***
## sevrm       -2.548837   0.055781 -45.693  < 2e-16 ***
## clzTRUE      0.283731   0.070998   3.996 6.43e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 32872  on 29421  degrees of freedom
## Residual deviance: 18976  on 29412  degrees of freedom
## AIC: 18996
## 
## Number of Fisher Scoring iterations: 6
k <- 10
kfInd <- sample(rep(c(1:k),length.out=nrow(trainData)))
xvalResultsLR <- tibble()

for (fold in 1:k) {
  trainDataFit <- trainData[kfInd != fold, ]
  trainDataVal <- trainData[kfInd == fold, ]

  lrFit <- glm(
    yn ~ cg + fw + cl + hw + nc + ag + e2 + se + clz, 
    data = trainDataFit, family = binomial)
  
  lrPredict <- predict(lrFit, newdata = trainDataVal, type = "response")
  lrCM <- table(trainDataVal$yn, round(lrPredict))

  xvalResultsLR <- rbind(xvalResultsLR, tibble(
    fold = fold,
    acc = calcAcc(lrCM),
    sens = calcSensitivity(lrCM),
    spec = calcSpecificity(lrCM),
    model = "lr"
  ))
}
xvalResultsLR %>% gather(key = metric, value = value, -c(fold, model)) %>% 
  ggplot(aes(x = model, y = value, col = metric)) + geom_boxplot()

This logistic regression performs surprisingly well, scoring around a 90% validation accuracy with similar levels of sensitivity and specificity. Many of the predictors are significant, including cg, cl, nc, e2.

Extra points problem: interaction terms (5 extra points)

Assess the impact/significance of pairwise interaction terms for all pairwise combinations of covariates used in the model and report the top ten that most significantly improve model fit.

In order to assess the value of interaction terms in general, let’s look at what a logistic regression fit looks like using all features and all interaction terms.

numSubsetData <- bind_cols(
  select(trainData, yn),
  select_if(trainData, is.numeric),
) %>% select(-id)

lmFit <- glm(yn ~ . + .*., data = numSubsetData, family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(lmFit)
## 
## Call:
## glm(formula = yn ~ . + . * ., family = binomial, data = numSubsetData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.7107  -0.6219  -0.4481  -0.2226   2.9260  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.087e+01  6.371e+00   4.846 1.26e-06 ***
## cg          -1.075e+01  4.355e+00  -2.469 0.013562 *  
## fw           9.310e-01  3.634e-01   2.562 0.010408 *  
## cl          -2.188e+01  5.317e+00  -4.115 3.87e-05 ***
## hw          -4.255e-01  3.672e-01  -1.159 0.246556    
## nc           7.372e+00  1.352e+00   5.453 4.97e-08 ***
## ag          -5.422e+01  1.186e+01  -4.571 4.85e-06 ***
## cg:fw       -1.718e-01  8.333e-02  -2.062 0.039227 *  
## cg:cl        5.930e+00  3.402e+00   1.743 0.081291 .  
## cg:hw        2.838e-01  1.392e-01   2.039 0.041475 *  
## cg:nc        1.112e+00  3.319e-01   3.349 0.000810 ***
## cg:ag        2.125e+01  2.712e+00   7.835 4.69e-15 ***
## fw:cl       -6.171e-01  2.950e-01  -2.092 0.036480 *  
## fw:hw        3.397e-03  6.369e-03   0.533 0.593804    
## fw:nc        5.542e-04  2.344e-02   0.024 0.981139    
## fw:ag        5.854e-01  2.879e-01   2.034 0.041998 *  
## cl:hw        2.069e-01  2.842e-01   0.728 0.466671    
## cl:nc       -4.684e+00  1.090e+00  -4.296 1.74e-05 ***
## cl:ag        1.530e+01  8.473e+00   1.806 0.070872 .  
## hw:nc       -6.918e-02  2.344e-02  -2.952 0.003159 ** 
## hw:ag       -7.496e-01  2.102e-01  -3.567 0.000362 ***
## nc:ag       -4.662e+00  1.068e+00  -4.367 1.26e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 32872  on 29421  degrees of freedom
## Residual deviance: 24330  on 29400  degrees of freedom
## AIC: 24374
## 
## Number of Fisher Scoring iterations: 6

Several have high predictive power. Using feature selection methods, which ones should we pick?

# taken from homework 5, which was in turn taken from ISLR lab 6.5.3
predict.regsubsets <- function (object, newdata, id, ...){
  form=as.formula(object$call [[2]])
  mat=model.matrix(form,newdata)
  coefi=coef(object,id=id)
  xvars=names (coefi)
  mat[,xvars] %*% coefi
}
selectMethods <- c("exhaustive", "backward", "forward", "seqrep")
fitMetrics <- c("rsq","rss","adjr2","bic")
nVars <- ncol(numSubsetData) - 1 + ncol(combn(colnames(select(numSubsetData, -yn)), 2))
selectWhichInteract <- NULL
selectMetricsInteract <- NULL

for (method in selectMethods) {
  # fit using each method of determining subsets and save summary of results
  selectFit <- regsubsets(yn ~ . + .*., numSubsetData, method=method, nvmax=nVars)
  selectSummary <- summary(selectFit)
  
  # extract which features are used for each number of variables and save with method
  selectWhich <- as.tibble(selectSummary$which)
  selectWhich$method <- method
  selectWhich$vars <- 1:nVars
  selectWhichInteract <- rbind(selectWhichInteract, selectWhich)
  
  # extract fit quality metrics for each number of variables and save with method
  selectMetrics <- as.tibble(selectSummary[fitMetrics])
  selectMetrics$method <- method
  selectMetrics$vars <- 1:nVars
  selectMetricsInteract <- rbind(selectMetricsInteract, selectMetrics)
}

selectMetricsInteract <- selectMetricsInteract %>% 
  gather(key = metric, value = value, fitMetrics)
ggplot(selectMetricsInteract, aes(x = vars,y = value,shape = method,colour = method)) + 
  geom_path() + 
  geom_point() + 
  facet_wrap(~metric,scales="free") + 
  theme(legend.position="top")

selectWhichInteract %>% 
  gather(key = param, value = selected, -c(method, vars)) %>%
  ggplot(aes(x = vars, y = param, fill = selected)) + 
  geom_tile() + 
  facet_wrap(~method)

This chart is a little difficult to read, but it’s possible to tell that several features such as cg, nc, cl:nc, and hw are often selected first. What are the most commonly selected at any level?

selectWhichInteract %>% 
  gather(key = param, value = selected, -c(method, vars)) %>% 
  group_by(param) %>% 
  summarize(sum = sum(selected)) %>%
  arrange(desc(sum)) %>%
  head(15) %>%
  ggplot(aes(x = fct_reorder(param, sum, .desc = TRUE), y = sum)) + geom_col()

It appears the selections drop off after cg:hw. Let’s try fitting a logistic regression with all features up to that point.

lrFit <- glm(
  yn ~ cg + nc + cl*nc + hw + cl + cg*ag + cl*ag + ag + cg*hw + e2 + clz + se, 
  data = trainData, family = binomial)

summary(lrFit)
## 
## Call:
## glm(formula = yn ~ cg + nc + cl * nc + hw + cl + cg * ag + cl * 
##     ag + ag + cg * hw + e2 + clz + se, family = binomial, data = trainData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.8847  -0.4904  -0.3004  -0.1191   3.4479  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  17.68636    3.96518   4.460 8.18e-06 ***
## cg            3.33068    1.00851   3.303 0.000958 ***
## nc           10.92184    1.43553   7.608 2.78e-14 ***
## cl          -15.50187    3.25640  -4.760 1.93e-06 ***
## hw            0.04103    0.17363   0.236 0.813215    
## ag          -65.55824    9.60615  -6.825 8.82e-12 ***
## e2TRUE       -3.09188    0.05792 -53.379  < 2e-16 ***
## clzTRUE       0.32794    0.07147   4.588 4.47e-06 ***
## sevrm        -2.52787    0.05566 -45.419  < 2e-16 ***
## nc:cl        -7.04095    1.04328  -6.749 1.49e-11 ***
## cg:ag        11.00178    2.52944   4.349 1.36e-05 ***
## cl:ag        34.45720    8.05609   4.277 1.89e-05 ***
## cg:hw        -0.03663    0.11597  -0.316 0.752139    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 32872  on 29421  degrees of freedom
## Residual deviance: 18845  on 29409  degrees of freedom
## AIC: 18871
## 
## Number of Fisher Scoring iterations: 6
k <- 10
kfInd <- sample(rep(c(1:k),length.out=nrow(trainData)))
xvalResultsLRInt <- tibble()

for (fold in 1:k) {
  trainDataFit <- trainData[kfInd != fold, ]
  trainDataVal <- trainData[kfInd == fold, ]

  lrFit <- glm(
    yn ~ cg + nc + cl*nc + hw + cl + cg*ag + cl*ag + ag + cg*hw + e2 + clz + se, 
    data = trainDataFit, family = binomial)
  
  lrPredict <- predict(lrFit, newdata = trainDataVal, type = "response")
  lrCM <- table(trainDataVal$yn, round(lrPredict))

  xvalResultsLRInt <- rbind(xvalResultsLRInt, tibble(
    fold = fold,
    acc = calcAcc(lrCM),
    sens = calcSensitivity(lrCM),
    spec = calcSpecificity(lrCM),
    model = "lr-int"
  ))
}
xvalResultsLRInt %>% gather(key = metric, value = value, -c(fold, model)) %>% 
  ggplot(aes(x = model, y = value, col = metric)) + geom_boxplot()

The fit is good, but not significantly better than the logistic regression without interactions. Let’s create a tbl with these extra features just in case another model can make better use of them.

moreData <- trainDataDummies %>%
  mutate(
    clnc = cl * nc,
    cgag = cg * ag,
    clag = cl * ag, 
    cghw = cg * hw
  )

Problem 3: linear discriminant analysis (15 points)

Fit linear discriminant analysis model of the outcome yn as a function of the rest of covariates in the dataset. Feel free to decide whether you want to use all of them or a subset of those. Test resulting model performance on multiple splits of the data into training and test subsets, summarize it in terms of accuracy/error/sensitivity/specificity and compare them to those obtained for logistic regression.

k <- 10
kfInd <- sample(rep(c(1:k),length.out=nrow(trainData)))
xvalResultsLDA <- tibble()

for (fold in 1:k) {
  trainDataFit <- trainData[kfInd != fold, ]
  trainDataVal <- trainData[kfInd == fold, ]

  ldaFit <- lda(
    yn ~ cg + fw + cl + hw + nc + ag + e2 + se + clz, 
    data = trainDataFit)
  
  ldaPredict <- predict(ldaFit, newdata = trainDataVal)
  ldaCM <- table(trainDataVal$yn, as.numeric(ldaPredict$x > 0))

  xvalResultsLDA <- rbind(xvalResultsLDA, tibble(
    fold = fold,
    acc = calcAcc(ldaCM),
    sens = calcSensitivity(ldaCM),
    spec = calcSpecificity(ldaCM),
    model = "lda"
  ))
}
xvalResultsLDA %>% gather(key = metric, value = value, -c(fold, model)) %>% 
  ggplot(aes(x = model, y = value, col = metric)) + geom_boxplot()

LDA performs poorly in comparison to other classification methods on this dataset. The sensitivity especially is very low.

Extra points problem: quadratic discriminant analysis (5 extra points)

In our experience attempting to fit quadratic discriminant analysis model of the categorical outcome yn on this data results in a rank deficiency related error. Determine on how to correct this error and report resulting model training and test error/accuracy/etc. and how it compares to LDA and logistic regression above.

k <- 10
kfInd <- sample(rep(c(1:k),length.out=nrow(trainData)))
xvalResultsQDA <- tibble()

for (fold in 1:k) {
  trainDataFit <- trainData[kfInd != fold, ]
  trainDataVal <- trainData[kfInd == fold, ]

  qdaFit <- qda(
    yn ~ cg + fw + cl + hw + nc + ag + e2 + se + clz, 
    data = trainDataFit)
  
  qdaPredict <- predict(qdaFit, newdata = trainDataVal)
  qdaCM <- table(trainDataVal$yn, qdaPredict$class)

  xvalResultsQDA <- rbind(xvalResultsQDA, tibble(
    fold = fold,
    acc = calcAcc(qdaCM),
    sens = calcSensitivity(qdaCM),
    spec = calcSpecificity(qdaCM),
    model = "qda"
  ))
}
xvalResultsQDA %>% gather(key = metric, value = value, -c(fold, model)) %>% 
  ggplot(aes(x = model, y = value, col = metric)) + geom_boxplot()

QDA performs better than LDA, but not by much.

Problem 4: random forest (15 points)

Develop random forest model of outcome yn. Present variable importance plots and comment on relative importance of different attributes in the model. Did attributes showing up as more important in random forest model also appear as significantly associated with the outcome by logistic regression? Test model performance on multiple splits of data into training and test subsets, compare test and out-of-bag error estimates, summarize model performance in terms of accuracy/error/sensitivity/specificity and compare to the performance of logistic regression and LDA models above.

Let’s try fitting one on a sampled dataset first to see what the important features are.

smallSample <- trainData[sample(1:nrow(trainData), nrow(trainData) / 10, replace = TRUE), ]

rfFit <- randomForest(select(smallSample, -yn), smallSample$yn)
rfFit$importance %>% as.tibble(rownames = "feature") %>% 
  arrange(desc(MeanDecreaseGini)) %>% head(15) %>%
  ggplot(aes(x = fct_reorder(feature, MeanDecreaseGini, .desc = TRUE), y = MeanDecreaseGini)) + 
  geom_col()

By far the most important feature was cg. After that, e2, eg, ag, oc, cl, se, … etc, are also selected as important features.

smallSample <- trainData[sample(1:nrow(trainData), nrow(trainData) / 2, replace = TRUE), ]

tune.rf <- tuneRF(select(smallSample, -yn), smallSample$yn, improve = 0, stepFactor = 1.5)
## mtry = 4  OOB error = 4.71% 
## Searching left ...
## mtry = 3     OOB error = 5.17% 
## -0.0981241 0 
## Searching right ...
## mtry = 6     OOB error = 4.61% 
## 0.02164502 0 
## mtry = 9     OOB error = 4.72% 
## -0.02507375 0

It appears, based on running this a number of times, that the optimal mtry is somewhere between 6 and 9. Let’s call it 8.

k <- 10
kfInd <- sample(rep(c(1:k),length.out=nrow(trainData)))
xvalResultsRF <- tibble()

for (fold in 1:k) {
  trainDataFit <- trainData[kfInd != fold, ]
  trainDataVal <- trainData[kfInd == fold, ]

  rfFit <- randomForest(select(trainDataFit, -yn), trainDataFit$yn, mtry = 8)
  
  rfPredict <- predict(rfFit, newdata = select(trainDataVal, -yn))
  rfCM <- table(trainDataVal$yn, rfPredict)

  xvalResultsRF <- rbind(xvalResultsRF, tibble(
    fold = fold,
    acc = calcAcc(rfCM),
    sens = calcSensitivity(rfCM),
    spec = calcSpecificity(rfCM),
    model = "rf"
  ))
}
xvalResultsRF %>% gather(key = metric, value = value, -c(fold, model)) %>% 
  ggplot(aes(x = model, y = value, col = metric)) + geom_boxplot()

The random forest performs very well. Higher than 92% accuracy with slightly better sensitivity than specificity.

Problem 5: SVM (20 points)

Develop SVM model of categorical outcome yn deciding on the choice of kernel, cost, etc. that appear to yield better performance. Test model performance on multiple splits of data into training and test subsets, summarize model performance in terms of accuracy/error/sensitivity/specificity and compare to the performance of the rest of the models developed above (logistic regression, LDA, random forest).

I’ll just use a small subset of the data to tune the svm parameters.

smallSample <- trainData[sample(1:nrow(trainDataDummies), nrow(trainDataDummies) / 10, replace = TRUE), ]
smallSample <- smallSample %>% select_if(negate(is.factor)) %>% 
  bind_cols(smallSample %>% select(yn)) %>% select(-id)
# smallSample <- smallSample[, colSums(smallSample %>% select_if(is.numeric)) != 0]

costs <- c(0.3, 0.5, 0.8, 1)

tune.out <- tune(svm, yn ~ ., data = smallSample, 
                 kernel = "linear", scale = TRUE, 
                 ranges = list(cost = costs))
summary(tune.out)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##   0.3
## 
## - best performance: 0.1101314 
## 
## - Detailed performance results:
##   cost     error dispersion
## 1  0.3 0.1101314 0.01951401
## 2  0.5 0.1104716 0.01843488
## 3  0.8 0.1101314 0.01828987
## 4  1.0 0.1111519 0.01890601

After running this a few times, a cost of 0.8 seems to be the consensus pick for the best.

k <- 10
kfInd <- sample(rep(c(1:k),length.out=nrow(trainData)))
xvalResultsSVM <- tibble()

for (fold in 1:k) {
  trainDataFit <- trainData[kfInd != fold, ] %>% 
    select_if(negate(is.factor)) %>% 
    bind_cols(trainData[kfInd != fold, ] %>% select(yn)) %>% select(-id)
  trainDataVal <- trainData[kfInd == fold, ] %>% mutate(yn = as.numeric(yn) - 1) %>% 
    select_if(negate(is.factor)) %>%
    bind_cols(trainData[kfInd == fold, ] %>% select(yn)) %>% select(-id)

  svmFit <- svm(yn ~ ., 
                data = trainDataFit, kernel = "linear", scale = TRUE,
                cost = 0.8)
  
  svmPredict <- predict(svmFit, trainDataVal)
  svmCM <- table(trainDataVal$yn, svmPredict)

  xvalResultsSVM <- rbind(xvalResultsSVM, tibble(
    fold = fold,
    acc = calcAcc(svmCM),
    sens = calcSensitivity(svmCM),
    spec = calcSpecificity(svmCM),
    model = "lin-svm"
  ))
}
xvalResultsSVM %>% gather(key = metric, value = value, -c(fold, model)) %>% 
  ggplot(aes(x = model, y = value, col = metric)) + geom_boxplot()

The SVM appears to perform well, but still not as well as the random forest.

Extra 10 points: neural network model

Experiment with fitting neural network models of categorical outcome yn for this data and evaluate their performance on different splits of the data into training and test. Compare model performance to that for the rest of classifiers developed above.

I think the neuralnet package is a bit of a joke so I used the keras interface for R instead. First, because the classes are imbalanced, I’ll sample from the “yes” rows to ensure there’s an equal number of samples with each target response.

yesRows <- moreData %>% filter(yn == "yes") 
resampData <- yesRows %>% 
  .[sample(1:nrow(yesRows), nrow(moreData) - 2 * nrow(yesRows), replace=TRUE), ] %>% 
  bind_rows(moreData) %>% 
  mutate(ynn = as.numeric(yn) - 1) %>% 
  select_if(negate(is.factor)) %>% 
  select(-id)

resampData %>% count(ynn)
## # A tibble: 2 x 2
##     ynn     n
##   <dbl> <int>
## 1     0 22165
## 2     1 22165

After playing around for a bit, I settled on this model structure. Although to be fair, changing parameters only affects the accuracy really at the margins.

model <- keras_model_sequential()
model %>% 
  layer_dense(units = 128, activation = "relu", input_shape = ncol(resampData) - 1, name = "dense_1") %>%
  layer_dropout(rate = 0.3, name = "dropout_1") %>%
  layer_dense(units = 32, activation = "relu", name = "dense_2") %>%
  layer_dropout(rate = 0.3, name = "dropout_2") %>%
  layer_dense(units = 1, activation = "sigmoid", name = "dense_final")

summary(model)
## ___________________________________________________________________________
## Layer (type)                     Output Shape                  Param #     
## ===========================================================================
## dense_1 (Dense)                  (None, 128)                   8448        
## ___________________________________________________________________________
## dropout_1 (Dropout)              (None, 128)                   0           
## ___________________________________________________________________________
## dense_2 (Dense)                  (None, 32)                    4128        
## ___________________________________________________________________________
## dropout_2 (Dropout)              (None, 32)                    0           
## ___________________________________________________________________________
## dense_final (Dense)              (None, 1)                     33          
## ===========================================================================
## Total params: 12,609
## Trainable params: 12,609
## Non-trainable params: 0
## ___________________________________________________________________________

It’s a binary classification problem, so I’m using a sigmoid activation function on the final neuron, binary crossentropy as my loss function and the adam optimizer.

model %>% compile(
  loss = loss_binary_crossentropy,
  optimizer = optimizer_adam(),
  metrics = c('accuracy')
)

history <- model %>% fit(
  x = as.matrix(resampData %>% select(-ynn)), 
  y = as.matrix(resampData %>% select(ynn)), 
  epochs = 50, batch_size = 512, validation_split = 0.2
)
k <- 10
kfInd <- sample(rep(c(1:k),length.out=nrow(resampData)))
xvalResultsNN <- tibble()

for (fold in 1:k) {
  trainDataFit <- resampData[kfInd != fold, ]
  trainDataVal <- resampData[kfInd == fold, ]
  
  model <- keras_model_sequential()
  model %>% 
    layer_dense(units = 128, activation = "relu", input_shape = ncol(trainDataFit) - 1, name = "dense_1") %>%
    layer_dropout(rate = 0.3, name = "dropout_1") %>%
    layer_dense(units = 32, activation = "relu", name = "dense_2") %>%
    layer_dropout(rate = 0.3, name = "dropout_2") %>%
    layer_dense(units = 1, activation = "sigmoid", name = "dense_final")
  
  model %>% compile(
    loss = loss_binary_crossentropy,
    optimizer = optimizer_adam(),
    metrics = c('accuracy')
  )
  
  history <- model %>% fit(
    x = as.matrix(trainDataFit %>% select(-ynn)), 
    y = as.matrix(trainDataFit %>% select(ynn)), 
    epochs = 50, batch_size = 512, verbose = 0
  )
  
  nnPredict <- model %>% predict_classes(as.matrix(trainDataVal %>% select(-ynn)))
  nnCM <- table(trainDataVal$ynn, nnPredict)

  xvalResultsNN <- rbind(xvalResultsNN, tibble(
    fold = fold,
    acc = calcAcc(nnCM),
    sens = calcSensitivity(nnCM),
    spec = calcSpecificity(nnCM),
    model = "nn"
  ))
}
xvalResultsNN %>% gather(key = metric, value = value, -c(fold, model)) %>% 
  ggplot(aes(x = model, y = value, col = metric)) + geom_boxplot()

The results are pretty good. Fairly high sensitivity, which some regularization could help with.

xvalResults <- bind_rows(
  xvalResultsLR,
  xvalResultsLRInt,
  xvalResultsLDA,
  xvalResultsQDA,
  xvalResultsRF,
  xvalResultsSVM,
  xvalResultsNN
)

xvalResults %>% gather(key = metric, value = value, -c(fold, model)) %>% 
  ggplot(aes(x = fct_reorder(model, value, .desc = TRUE), y = value, col = metric)) +
  geom_boxplot()

In comparison with other methods, it falls a little bit short. Logistic regression, linear basis SVM, and random forest all have higher cross-validated test accuracy.

Problem 6: predictions for test dataset (10 points)

Problem 6a: compare logistic regression, LDA, random forest and SVM model performance (3 points)

Compare performance of the models developed above (logistic regression, LDA, random forest, SVM) in terms of their accuracy, error and sensitivity/specificity. Comment on differences and similarities between them.

xvalResults <- bind_rows(
  xvalResultsLR,
  xvalResultsLRInt,
  xvalResultsLDA,
  xvalResultsQDA,
  xvalResultsRF,
  xvalResultsSVM,
  xvalResultsNN
)

xvalResults %>% gather(key = metric, value = value, -c(fold, model)) %>% 
  ggplot(aes(x = fct_reorder(model, value, .desc = TRUE), y = value, col = metric)) +
  geom_boxplot()

Right out of the gate, it’s easy to see how much poorer the lda and qda methods perform on this dataset. The overall accuracy of the logistic regressions, svm, and neural nets are all fairly similar, although the neural net is the only one of those three with higher sensitivity than specificity.

In all, the random forest (without additional added parameters) is the best model, and what I’ll choose to make my test set predictions on.

Problem 6b: make predictions for the test dataset (3 points)

Decide on the model that performs the best and use it to make predictions for the test dataset. This is the dataset that is provided separately from training data without the outcome yn that we are modeling here. Upload resulting predictions in comma-separated values (CSV) format into the Canvas website. Please check sample files with test dataset predictions for the expected format of the *.csv file with predictions.

Before trying a prediction, let’s take one last look at the random forest using some additional data from the factor columns and interaction terms discovered using logistic regression methods.

rfFit <- randomForest(select(moreData, -yn), moreData$yn)
smallSample <- moreData[sample(1:nrow(moreData), nrow(moreData) / 5, replace = TRUE), ]

tune.rf <- tuneRF(select(smallSample, -yn), smallSample$yn, 
                  mtryStart = 14, improve = 0, stepFactor = 1.5)
## mtry = 14  OOB error = 7.55% 
## Searching left ...
## mtry = 10    OOB error = 7.92% 
## -0.04954955 0 
## Searching right ...
## mtry = 21    OOB error = 7.51% 
## 0.004504505 0 
## mtry = 31    OOB error = 7.63% 
## -0.0158371 0

rfFit$importance %>% as.tibble(rownames = "feature") %>% 
  arrange(desc(MeanDecreaseGini)) %>%
  head(20) %>%
  ggplot(aes(x = fct_reorder(feature, MeanDecreaseGini, .desc = TRUE), y = MeanDecreaseGini)) + 
  geom_col()

k <- 10
kfInd <- sample(rep(c(1:k),length.out=nrow(moreData)))
xvalResultsRFM <- tibble()

for (fold in 1:k) {
  trainDataFit <- moreData[kfInd != fold, ]
  trainDataVal <- moreData[kfInd == fold, ]

  rfmFit <- randomForest(select(trainDataFit, -yn), trainDataFit$yn, mtry = 14)
  
  rfmPredict <- predict(rfmFit, newdata = select(trainDataVal, -yn))
  rfmCM <- table(trainDataVal$yn, rfmPredict)

  xvalResultsRFM <- rbind(xvalResultsRFM, tibble(
    fold = fold,
    acc = calcAcc(rfmCM),
    sens = calcSensitivity(rfmCM),
    spec = calcSpecificity(rfmCM),
    model = "rf-m"
  ))
}
xvalResultsRFM %>% gather(key = metric, value = value, -c(fold, model)) %>% 
  ggplot(aes(x = model, y = value, col = metric)) + geom_boxplot()

xvalResults <- bind_rows(
  xvalResultsLR,
  xvalResultsLRInt,
  xvalResultsRF,
  xvalResultsRFM,
  xvalResultsSVM,
  xvalResultsNN
)

xvalResults %>% gather(key = metric, value = value, -c(fold, model)) %>% 
  ggplot(aes(x = fct_reorder(model, value, .desc = TRUE), y = value, col = metric)) +
  geom_boxplot()

It appears that the random forest without extra data still performs the best. With that settled, let’s make some predictions.

rawTestData <- read_csv("final-data-test.csv") %>% mutate_if(is.character, as.factor)
## Parsed with column specification:
## cols(
##   id = col_integer(),
##   cg = col_double(),
##   eg = col_character(),
##   re = col_character(),
##   ms = col_character(),
##   fw = col_double(),
##   cl = col_double(),
##   hw = col_double(),
##   ra = col_character(),
##   nc = col_double(),
##   se = col_character(),
##   ag = col_double(),
##   wc = col_character(),
##   ne = col_character(),
##   oc = col_character(),
##   dr = col_character()
## )
newcl <- predict(clLRFit, newdata = filter(rawTestData, cl == 0))

testData <- rawTestData %>%
  mutate(
    e2 = eg == "e2",
    sev = se == "vrm",
    clz = cl == 0,
    hw = log(hw + 1e-6)
    )
testData[testData$cl == 0, "cl"] <- newcl

testDataDummies <- as.tibble(dummy.data.frame(as.data.frame(testData, -yn)))
rfFit <- randomForest(select(trainData, -yn), trainData$yn)
rfPredict <- predict(rfFit, newdata = testData)

submission <- testData %>% select(id) %>% mutate(GenghisKhan = rfPredict)
submission %>% write_csv("cs63_final_results.csv")

Problem 6c: get better than coin flip by 10% (4 points)

This is not really a problem per se but rather a criterion we will go by assessing quality of your predictions for the test dataset. You get these four points if your predictions for test dataset are better than those obtained from a fair coin flip (already shown in leaderboard and as examples of the file format for predictions upload) by at least 10% on all four metrics shown in the leaderboard (accuracy, sensitivity, specificity and precision). But then predictions by the coin flip should not be very difficult to improve upon.

An afterword on the computational demands of the final exam

Because during previous offerings of this course there were always several posts on piazza regarding how long it takes to fit various classifiers to the final exam dataset we have added this note here.

First of all, we most definitely do not expect you to have to buy capacity from AWS to complete this assignment. You certainly can if you want to, but this course is not about that and this dataset is really not that big to require it. Something reasonable/useful can be accomplished for this data with middle of the road hardware. For instance, knitting of the entire official solution for the final exam on 8Gb RAM machine with two i5-7200u cores takes about an hour using single-threaded R/Rstudio and this includes both extra points problems as well as various assessments of the performance of different models as function of data size and so on.

Second, your solution should not take hours and hours to compile. If it does, it could be that it is attempting to do too much, or something is implemented inefficiently, or just plain incorrectly - it is impossible for us to comment on this until we see the code when we grade it. In general, it is often very prudent to “start small” – fit your model on a random subset of data small enough for the model fitting call to return immediately, check how model performance (both in terms of error and time it takes to compute) scales with the size of the data you are training it on (as you increase it in size, say, two-fold several times), for tuning start with very coarse grid of parameter values and given those results decide what it right for you, etc.

Lastly, making the decision about what is right for the problem at hand, how much is enough, etc. is inherent in this line of work. If you choose to conduct model tuning on a subset of the data - especially if you have some assessment of how the choice of tuning parameter and test error is affected by the size of training dataset - it could be a very wise choice. If it is more efficient for you to knit each problem separately, by all means feel free to do that - just remember to submit each .Rmd and HTML file that comprises your entire solution. On that note, if you end up using any of the unorthodox setups for your calculations (e.g. AWS, parallel processing, multiple machines, etc. - none of which are essential for solving it correctly) please be sure that when we grade we have every relevant piece of code available - we won’t be able to grade your work if we are not clear about how the results were obtained.

In the end, the final exam asks you to assess performance of several classification technologies on a new dataset, decide on which classifier is the best and use it to make predictions for the test data. It is very much up to you how exactly you want to go about it. There could be many versions of correct and informative solution for that (as there could be just as many if not more that are completely wrong).

As always, best of luck - we are practically done here!